Skip to content

Added Python Implementation of Suffix Arrays and LCP Arrays #12171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

putul03
Copy link

@putul03 putul03 commented Oct 19, 2024

This pull request provides an implementation of Suffix Arrays and Longest Common Prefix (LCP) Arrays in Python, contributing to the open-source community as part of Hacktoberfest 2024.


Overview:

A suffix array is a fundamental data structure used in various text processing algorithms. This project implements:

  • Suffix Array Construction: Efficiently builds a suffix array by sorting all suffixes of a string and storing their starting indices.
  • LCP Array Construction: Constructs the LCP array that records the lengths of the longest common prefixes between consecutive suffixes in the sorted suffix array.

Key Features:

  • Efficient Construction: Both the suffix array and LCP array are computed efficiently, ensuring linear-time construction for the LCP array after suffix sorting.
  • User-friendly Display: Clearly shows both the suffix and LCP arrays for any input string, aiding in visualization.

Why this Contribution?

As part of Hacktoberfest 2024, this contribution aims to assist developers working with text-processing algorithms. This implementation serves as a foundation for more advanced algorithms in fields such as bioinformatics, data compression, and natural language processing.

Example Output:

For the input string "banana", the program generates:

  • Suffix Array: [5, 3, 1, 0, 4, 2]
  • LCP Array: [0, 1, 3, 0, 0, 2]

Why Suffix Arrays and LCP Arrays Matter:

  • Text Searching: Essential for fast substring searching.
  • Pattern Detection: Highlights repeated patterns within the text, useful in data compression.
  • Bioinformatics: Critical for genome sequencing and alignment algorithms.

References:

For more information on suffix arrays and LCP arrays:


Describe your change:

  • Add an algorithm.
  • Fix a bug or typo in an existing algorithm.
  • Add or change doctests? -- Note: Please avoid changing both code and tests in a single pull request.
  • Documentation change.

Checklist:

  • I have read CONTRIBUTING.md.
  • This pull request is all my own work -- I have not plagiarized.
  • I know that pull requests will not be merged if they fail the automated tests.
  • This PR only changes one algorithm file. To ease review, please open separate PRs for separate algorithms.
  • All new Python files are placed inside an existing directory.
  • All filenames are in all lowercase characters with no spaces or dashes.
  • All functions and variable names follow Python naming conventions.
  • All function parameters and return values are annotated with Python type hints.
  • All functions have doctests that pass the automated testing.
  • All new algorithms include at least one URL that points to Wikipedia or another similar explanation.
  • If this pull request resolves one or more open issues then the description above includes the issue number(s) with a closing keyword: "Fixes #ISSUE-NUMBER".

putul03 and others added 14 commits October 19, 2024 12:05
This code file provides an implementation of Suffix Arrays and Longest Common Prefix (LCP) Arrays in Python, designed as a contribution to the open-source community during Hacktoberfest 2024.
Overview:
A suffix array is an essential data structure used in many string-processing algorithms. It provides an efficient way to store and sort all possible suffixes of a given string. This project also includes the construction of the LCP array, which records the lengths of the longest common prefixes between consecutive suffixes in the sorted suffix array. Together, these two arrays form the backbone of many algorithms in text processing and pattern matching.

Key Features:
Suffix Array Construction: A suffix array is built by sorting all suffixes of the input string in lexicographical order and storing their starting indices.
LCP Array Construction: The LCP array is computed using an efficient algorithm that compares consecutive suffixes from the suffix array and records the length of their common prefixes.
Optimized Approach: The approach used in this implementation ensures efficient computation of both suffix and LCP arrays with a linear-time construction of the LCP array following the suffix sorting.
User-friendly Display: The program clearly displays both the suffix and LCP arrays, allowing users to easily visualize and understand the results for any given input string.
Why this Contribution?
As part of Hacktoberfest 2024, I wanted to contribute something that could be useful for developers and researchers working with text-processing algorithms. This implementation not only helps in better understanding of basic string operations but also serves as a building block for more complex algorithms in fields like bioinformatics, data compression, and natural language processing.

Example Output:
For the input string "banana", the program generates the following arrays:

Suffix Array: [5, 3, 1, 0, 4, 2] (indicating the starting indices of the lexicographically sorted suffixes)
LCP Array: [0, 1, 3, 0, 0, 2] (showing the lengths of the longest common prefixes between consecutive suffixes)
Why Suffix Arrays and LCP Arrays Matter:
Text Searching: Suffix arrays are used in algorithms for fast substring searching, making them invaluable in tasks like searching through large databases or text files.
Repetitive Patterns: The LCP array highlights repeated patterns within the text, which can be useful in applications like data compression, where redundancy needs to be minimized.
Bioinformatics: These arrays are critical for genome sequencing and alignment algorithms, where comparing large sequences efficiently is necessary.
How to Use:
This implementation is easy to run with any input string, and users can quickly get a clear visualization of the suffix and LCP arrays. Whether you're new to algorithms or looking to expand your toolkit for more advanced string manipulation tasks, this project provides a solid foundation.
@algorithms-keeper algorithms-keeper bot added awaiting reviews This PR is ready to be reviewed tests are failing Do not merge until tests pass labels Oct 19, 2024
@algorithms-keeper algorithms-keeper bot removed the tests are failing Do not merge until tests pass label Oct 19, 2024
@putul03
Copy link
Author

putul03 commented Oct 19, 2024

I would greatly appreciate your feedback on my Python implementation of Suffix Arrays and LCP Arrays—your insights would mean a lot!

@algorithms-keeper algorithms-keeper bot added the require tests Tests [doctest/unittest/pytest] are required label Oct 19, 2024
Copy link

@algorithms-keeper algorithms-keeper bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Click here to look at the relevant links ⬇️

🔗 Relevant Links

Repository:

Python:

Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.

algorithms-keeper commands and options

algorithms-keeper actions can be triggered by commenting on this PR:

  • @algorithms-keeper review to trigger the checks for only added pull request files
  • @algorithms-keeper review-all to trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.

NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.

@algorithms-keeper algorithms-keeper bot removed the require tests Tests [doctest/unittest/pytest] are required label Oct 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting reviews This PR is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant